Project Summary

For our network analysis project, we decided to explore the connections between Spotify artists to understand how musical collaboration and similarity create networks in the music industry. We obtained a dataset containing artist nodes and their relationships, which allowed us to examine collaboration patterns and identify influential artists who serve as bridges between different musical communities.

Our analysis focused on visualizing the network structure, calculating centrality measures to identify key artists, and performing community detection to understand how artists cluster together. We were particularly interested in finding out which artists have the most connections and how the overall network structure reveals patterns in musical collaboration. Through various visualization techniques and statistical analysis, we discovered interesting insights about the connectivity patterns and community structures within the Spotify artist ecosystem.

Setup and Data Description

# Load all the packages we need for network analysis
library(tidyverse)    # for data manipulation and visualization
library(igraph)       # main network analysis package
library(ggraph)       # for pretty network plots
library(ggrepel)      # to avoid overlapping labels
library(knitr)        # for nice tables
# library(DT)           # for interactive tables - commented out for PDF compatibility

About the Dataset

# Read in our data files
nodes <- read_csv("nodes.csv")
edges <- read_csv("edges.csv")

# Let's see what we're working with
cat("Node column names:\n")
## Node column names:
print(names(nodes))
## [1] "spotify_id" "name"       "followers"  "popularity" "genres"    
## [6] "chart_hits"
cat("\nEdge column names:\n")
## 
## Edge column names:
print(names(edges))
## [1] "id_0" "id_1"

This dataset represents a network of Spotify artists where:

  • Nodes (vertices): Individual artists on Spotify
  • Edges: Relationships between artists (could represent collaborations, similar musical styles, or listener overlap)
  • Data Source: The data was collected from Spotify’s API and artist relationship databases
  • Collection Period: Data reflects artist relationships as of 2023-2024
  • Access: Dataset can be found at Spotify Artist Network Dataset

The network allows us to study how artists connect to each other and identify important artists who bridge different musical communities.

Data Preprocessing

# First we need to clean up any duplicate artists
id_column <- names(nodes)[1]
cat("Using", id_column, "as our artist ID column\n")
## Using spotify_id as our artist ID column
# Check how many duplicates we have
cat("Total nodes:", nrow(nodes), "\n")
## Total nodes: 156422
cat("Unique nodes:", length(unique(nodes[[id_column]])), "\n")
## Unique nodes: 156320
# Remove duplicates - keep the first occurrence
nodes_clean <- nodes[!duplicated(nodes[[id_column]]), ]
cat("Nodes after cleaning:", nrow(nodes_clean), "\n")
## Nodes after cleaning: 156320
# Make sure all artists in our edge list actually exist in our node list
edge_artists <- unique(c(edges[[1]], edges[[2]]))
node_artists <- nodes_clean[[1]]

# Find any artists mentioned in edges but missing from nodes
missing_in_nodes <- setdiff(edge_artists, node_artists)
cat("Artists in edges but not in nodes:", length(missing_in_nodes), "\n")
## Artists in edges but not in nodes: 6
# Clean edges to only include artists we have data for
edges_clean <- edges[edges[[1]] %in% node_artists & edges[[2]] %in% node_artists, ]
cat("Edges before filtering:", nrow(edges), "\n")
## Edges before filtering: 300386
cat("Edges after filtering:", nrow(edges_clean), "\n")
## Edges after filtering: 300379

Network Creation and Basic Analysis

# Now we can create our network graph object
cat("Building the network graph...\n")
## Building the network graph...
graph <- graph_from_data_frame(d = edges_clean, vertices = nodes_clean, directed = FALSE)
cat("Network created successfully!\n")
## Network created successfully!
# Calculate some important network measures
# Degree centrality shows how many connections each artist has
V(graph)$degree <- degree(graph)

# Community detection to find groups of closely connected artists
communities <- cluster_louvain(graph)
V(graph)$community <- communities$membership

# Check if the network is all connected or has separate pieces
components <- components(graph)
V(graph)$component <- components$membership
# Create a summary table of our network's basic properties
network_stats <- data.frame(
  Metric = c("Total Artists", "Total Connections", "Number of Communities", 
             "Average Connections per Artist", "Network Density", "Number of Components"),
  Value = c(
    vcount(graph),
    ecount(graph),
    max(V(graph)$community),
    round(mean(V(graph)$degree), 2),
    round(edge_density(graph), 4),
    max(V(graph)$component)
  )
)

kable(network_stats, caption = "Basic Network Statistics")
Basic Network Statistics
Metric Value
Total Artists 156320.00
Total Connections 300379.00
Number of Communities 4516.00
Average Connections per Artist 3.84
Network Density 0.00
Number of Components 4338.00

The network statistics show us the overall structure of our artist network. The network density tells us how interconnected the artists are, while the number of communities reveals how many distinct groups of artists exist.

Network Visualizations

# For better visualization, we'll focus on the most connected artists
min_degree <- 3
high_degree_nodes <- which(V(graph)$degree >= min_degree)

cat("Creating focused view with artists having", min_degree, "or more connections\n")
## Creating focused view with artists having 3 or more connections
cat("This gives us", length(high_degree_nodes), "artists to visualize\n")
## This gives us 30548 artists to visualize
# If we still have too many nodes, we'll take the top ones by degree
if(length(high_degree_nodes) > 1000) {
  top_nodes <- order(V(graph)$degree, decreasing = TRUE)[1:1000]
  subgraph <- induced_subgraph(graph, top_nodes)
  cat("Showing top 1000 most connected artists\n")
} else if(length(high_degree_nodes) > 0) {
  subgraph <- induced_subgraph(graph, high_degree_nodes)
} else {
  # Fallback: random sample if no highly connected artists
  sample_size <- min(500, vcount(graph))
  sampled_nodes <- sample(V(graph), sample_size)
  subgraph <- induced_subgraph(graph, sampled_nodes)
  cat("Using random sample of", sample_size, "artists\n")
}
## Showing top 1000 most connected artists

Main Network Visualization

set.seed(123)  # for reproducible layout

# Create our main network visualization
network_plot <- ggraph(subgraph, layout = "fr") +
  geom_edge_link(alpha = 0.2, color = "gray", width = 0.5) +
  geom_node_point(aes(size = degree, color = as.factor(community)), alpha = 0.7) +
  scale_size_continuous(range = c(1, 6), name = "Number of\nConnections") +
  scale_color_discrete(name = "Community") +
  theme_void() +
  theme(legend.position = "bottom") +
  labs(title = "Spotify Artist Collaboration Network",
       subtitle = paste("Showing", vcount(subgraph), "most connected artists"))

print(network_plot)
Spotify Artist Network showing different communities and connection strengths

Spotify Artist Network showing different communities and connection strengths

This visualization shows how artists cluster into communities (shown by color) and reveals which artists have the most connections (shown by node size). The force-directed layout helps us see the natural groupings in the network.

Simplified Network View

# Sometimes a simpler view is clearer
simple_plot <- ggraph(subgraph, layout = "fr") +
  geom_edge_link(alpha = 0.3, color = "lightgray") +
  geom_node_point(aes(size = degree), color = "steelblue", alpha = 0.7) +
  scale_size_continuous(range = c(2, 8), name = "Connections") +
  theme_void() +
  labs(title = "Artist Network (Simplified View)")

print(simple_plot)
Clean view of artist connections without community colors

Clean view of artist connections without community colors

Highly Connected Artists

# Let's look at just the super-connected artists
highly_connected <- which(V(graph)$degree > 10)
if(length(highly_connected) > 0) {
  custom_subgraph <- induced_subgraph(graph, highly_connected)
  
  # Circular layout works well for smaller networks
  influential_plot <- ggraph(custom_subgraph, layout = "circle") +
    geom_edge_link(alpha = 0.3, color = "darkgray") +
    geom_node_point(aes(size = degree, color = as.factor(community)), alpha = 0.8) +
    geom_node_text(aes(label = ifelse(degree > 20, name, '')), 
                   size = 2, repel = TRUE, max.overlaps = 10) +
    scale_color_viridis_d(name = "Community") +
    scale_size_continuous(range = c(3, 10), name = "Connections") +
    theme_void() +
    labs(title = "Most Influential Artists in the Network",
         subtitle = "Artists with more than 10 connections")
  
  print(influential_plot)
} else {
  cat("No artists found with more than 10 connections for this visualization\n")
}
Circular layout highlighting the most influential artists

Circular layout highlighting the most influential artists

Detailed Analysis Results

Most Connected Artists

# Let's see who the most connected artists are
node_metrics <- data.frame(
  Artist = V(graph)$name,
  Connections = V(graph)$degree,
  Community = V(graph)$community,
  Component = V(graph)$component
)

# Show the top 20 most connected artists
top_artists <- head(node_metrics[order(-node_metrics$Connections), ], 20)

# Use kable instead of DT for PDF compatibility
kable(
  top_artists, 
  caption = "Top 20 Most Connected Artists",
  row.names = FALSE
)
Top 20 Most Connected Artists
Artist Connections Community Component
Johann Sebastian Bach 1781 9 1
Traditional 1371 9 1
Mc Gw 858 26 1
MC MN 632 26 1
Jean Sibelius 580 9 1
Armin van Buuren 513 19 1
Gucci Mane 509 19 1
Steve Aoki 498 19 1
Snoop Dogg 495 19 1
Diplo 494 19 1
Tiësto 475 19 1
A.R. Rahman 463 71 1
Mc Delux 461 26 1
David Guetta 452 19 1
Mc Rd 440 26 1
R3HAB 420 19 1
John Williams 415 9 1
הכוכב הבא 377 42 1
Pritam 375 71 1
DJ Antoine 372 41 1

The artists with the most connections are likely to be major collaborative artists or those who bridge different musical genres. These “hub” artists play crucial roles in connecting different parts of the music network.

Community Structure Analysis

# Analyze the communities we found
community_sizes <- node_metrics %>%
  group_by(Community) %>%
  summarise(
    Artists_in_Community = n(),
    Average_Connections = round(mean(Connections), 2),
    Most_Connected_Artist = max(Connections)
  ) %>%
  arrange(desc(Artists_in_Community))

# Show the largest communities
kable(
  head(community_sizes, 10),
  caption = "Largest Artist Communities"
)
Largest Artist Communities
Community Artists_in_Community Average_Connections Most_Connected_Artist
19 24536 5.83 513
7 13590 5.27 346
26 8662 4.87 858
9 6826 2.38 1781
71 6305 4.04 463
41 5606 3.06 372
45 4487 4.30 248
1 4377 3.44 123
2 4361 4.06 231
36 4023 3.08 260

The community analysis helps us understand how artists group together. Larger communities might represent major musical genres or collaborative networks, while smaller communities could be niche genres or tight-knit artist groups.

Connection Distribution

# Look at the overall pattern of connections
ggplot(node_metrics, aes(x = Connections)) +
  geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7, color = "white") +
  scale_x_log10() +
  labs(
    title = "Distribution of Artist Connections",
    x = "Number of Connections (log scale)",
    y = "Number of Artists"
  ) +
  theme_minimal()
How artist connections are distributed across the network

How artist connections are distributed across the network

This distribution shows us whether we have a “scale-free” network (common in social networks) where most artists have few connections but a small number have many connections.

# Save our visualizations for the presentation
ggsave("spotify_network_main.png", plot = network_plot, 
       width = 12, height = 10, dpi = 300, bg = "white")
ggsave("spotify_network_simple.png", plot = simple_plot, 
       width = 12, height = 10, dpi = 300, bg = "white")

# Export our analysis results
write_csv(node_metrics, "spotify_artist_metrics.csv")

cat("Saved files:\n")
cat("- Main network visualization\n")
cat("- Simplified network plot\n") 
cat("- Artist metrics spreadsheet\n")

Conclusions and Key Findings

Our analysis of the Spotify artist network revealed several important insights about how artists connect and collaborate in the music industry. The network contains 1.5632^{5} artists connected through 3.00379^{5} relationships, with an average of 3.84 connections per artist. We identified 4516 distinct communities, suggesting that artists naturally cluster into groups that likely represent different genres, collaborative circles, or regional music scenes.

The most significant finding is the presence of highly connected “hub” artists who serve as bridges between different musical communities. These artists play a crucial role in maintaining the overall connectivity of the network and facilitating cross-genre collaboration. The visualizations clearly show both the tight-knit nature of individual communities and the broader network structure that connects them. This pattern suggests that while artists tend to work within specific musical circles, there are key individuals who help connect these circles and enable the flow of musical influence across different genres and communities.


Network analysis completed on 2025-06-06